Systems and methods for transferring data to maintain preferred slot positions in a bi-endian processor

ABSTRACT

A bi-endian multiprocessor system having multiple processing elements, each of which includes a processor core, a local memory and a memory flow controller. The memory flow controller transfers data between the local memory and data sources external to the processing element. If the processing element and the data source implement data representations having the same endian-ness, each multi-word line of data is stored in the local memory in the same word order as in the data source. If the processing element and the data source implement data representations having different endian-ness, the words of each multi-word line of data are transposed when data is transferred between local memory and the data source. The processing element may incorporate circuitry to add doublewords, wherein the circuitry can alternately carry bits from a first word to a second word or vice versa, depending upon whether the words in lines of data are transposed.

BACKGROUND

1. Field of the Invention

The invention relates generally to computer systems and moreparticularly to systems and methods for enabling computer systems thatimplement data representations having a particular “endian-ness” tooperate in conjunction with data sources that may have either the sameendian-ness or different endian-ness.

2. Related Art

Modern computer systems provide ever-increasing capacity to processdata. This capacity is expanded by making processors faster, smaller andmore efficient. Often, computer systems will implement several processorcores to process data in parallel. The increased computing capability ofthese computers, however, is meaningless if the computers cannotproperly communicate the data that is processed.

One of the issues that can cause problems in the communication of datais the manner in which data is represented. Computer systems typicallyrepresent data as a series of binary digits (1's and 0's). A byte (8bits) of data can represent a decimal value from 0 to 255. For instance,the bit string “00000001” represents the decimal value “1”.

In order for a computer system to properly interpret this bit string,however, it is necessary to know which bit is the first bit in thestring, or if a data word contains multiple bytes, which byte is thefirst byte. Some computer systems store data beginning with the leastsignificant byte (a “little-endian” data representation), while othersstore data beginning with the most significant byte (a “big-endian” datarepresentation). If the same computer reads and writes the data, it willbe properly interpreted, because only a single data representation isused. If different computer systems (or other devices) that usedifferent data representations are used to write and read the data,however, care must be taken to ensure that the data is properlyinterpreted by both systems. If a data word is interpreted beginningwith the wrong end of the word, the value of the data word will bemisinterpreted.

Another issue that may cause problems in the communication of databetween systems that use different data representations relates to theuse of SIMD (single instruction, multiple data) instructions. SIMD-typeinstructions include multiple words that are interpreted as both controlinformation and data. For instance, one system might use 128-bit linesof data (four 32-bit words) that can be interpreted as controlinformation and three data words. In one instance, the line of datacould be interpreted as a one-word shift value or address, with threewords of data. In another alternative, the line of data could beinterpreted as four words of data, one or more of which will be replacedby the result of a computation involving the four data words.

Typically, the word that contains the control scalar value (or shiftamount, address, computation result, etc.) is identified by its positionin a “preferred slot”. A preferred slot is simply a designated positionin which this type of data is stored. The location of the preferred slotmay depend upon its size—for instance, in a 128-bit line of data, it maybe a designated byte, halfword, 32-bit word, doubleword or quadword.

One difficulty that may arise in relation to preferred slots is that, ina processor that accommodates both big-endian and little-endian datasources, the positions of the preferred slots may change. The changingof the preferred slots' positions is typically handled by incorporatingmultiplexers into the design of the processor, so that the preferredslot data can be read from or written to the different positions thatthe preferred slots may occupy. These multiplexers add complexity to thedesign, increase the space required for the processor, increase thedelay in accessing the preferred slot data and potentially limit themaximum operating speed of the processor. It would therefore bedesirable to provide systems and methods for accommodating the changingpreferred slot positions resulting from changing data representationswithout having to use these multiplexers.

SUMMARY OF THE INVENTION

One or more of the problems outlined above may be solved by the variousembodiments of the invention. Broadly speaking, the invention includessystems and methods for accommodating changing preferred slot positionsin a bi-endian processor system by transposing words within lines ofdata when the lines of data are transferred between the processor, whichuses a first data representation, and a data source which uses adifferent data representation.

In one embodiment, a multiprocessor system includes multiple processingelements. Each processing element includes a processor core, a localmemory dedicated for use by the processor core, and a memory flowcontroller configured to transfer data between the local memory and anexternal data source. The processing element may also include a localcache memory which is separate from the local memory. The memory flowcontroller of each processing element is configured to transfermultiple-word lines of data between the processing element and the datasource in either of two modes. In a first mode that is used when theendian-ness of the processing element is the same as the data source,the memory flow controller transfers each line of data withouttransposing the words in each line of data. In a second mode that isused when the endian-ness of the processing element is different fromthe data source, the memory flow controller transposes the words in eachline of data when the memory flow controller transfers the line of data.The processing element may also include logic circuitry that isconfigured to add doublewords in the lines of data. This logic circuitrycarries bits from the first words of the doublewords to the second wordsof the doublewords, or from the second words to the first words,depending upon the endian-ness of the processing element and the datasource.

Numerous additional embodiments are also possible.

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects and advantages of the invention may become apparent uponreading the following detailed description and upon reference to theaccompanying drawings.

FIG. 1 is a functional block diagram illustrating the interconnection ofa computer processor with cache memories and a main memory in accordancewith the prior art.

FIG. 2 is a functional block diagram illustrating a multiprocessorcomputer system in accordance with the prior art.

FIG. 3 is a block diagram illustrating the structure of a multiprocessorsystem that includes both a primary processor core that is linked to themain memory using a conventional memory hierarchy, and multipleprocessing elements that include local memories and cache memories thatare linked to the main memory in accordance with one embodiment.

FIG. 4 is a block diagram illustrating the structure of a multiprocessorsystem that includes both a primary processor core and eight processingelements in accordance with one embodiment.

FIG. 5 is a diagram illustrating the locations of preferred slots in a16-byte data line in accordance with one embodiment.

FIG. 6 is a functional block diagram illustrating the structure of aprocessor core within a bi-endian multiprocessor system in accordancewith the prior art.

FIG. 7 is a functional block diagram illustrating the structure of aprocessing element in accordance with one embodiment.

FIG. 8 includes a pair of diagrams illustrating the modes of operationof a memory flow controller in accordance with one embodiment.

FIG. 9 is a functional block diagram of a logic circuit configured toproperly carry a bit from the least significant word of a doubleword tothe most significant word in accordance with one embodiment.

While the invention is subject to various modifications and alternativeforms, specific embodiments thereof are shown by way of example in thedrawings and the accompanying detailed description. It should beunderstood that the drawings and detailed description are not intendedto limit the invention to the particular embodiments which aredescribed. This disclosure is instead intended to cover allmodifications, equivalents and alternatives falling within the scope ofthe present invention as defined by the appended claims.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

One or more embodiments of the invention are described below. It shouldbe noted that these and any other embodiments described below areexemplary and are intended to be illustrative of the invention ratherthan limiting.

Broadly speaking, the invention includes systems and methods foraccommodating changing preferred slot positions in a bi-endian processorsystem by transposing words within lines of data when the lines of dataare transferred between the processor, which uses a first datarepresentation, and a data source which uses a different datarepresentation.

Conventional computer systems typically employ a memory system thatincludes not only a main memory, but also one or more cache memories. Atypical memory hierarchy is illustrated in FIG. 1. FIG. 1 is afunctional block diagram that shows the interconnection of a computerprocessor with cache memories and a main memory. Processor 110 iscoupled to a first cache memory 120, which is typically referred to asthe level 1, or L1 cache. The L1 cache is, in turn, coupled to cachememory 130, which is referred to as the level 2, or L2 cache. L2 cache130 is coupled to main memory 150.

Main memory 150 may be capable of storing up to four gigabytes of data,but it typically requires multiple processor cycles to perform each dataaccess to the main memory. Cache memories 120 and 130 are provided inorder to reduce the latency of these data accesses. Each of the cachememories is substantially smaller than the main memory, but they can beaccessed more quickly (with lower data latency) than the main memory.Each successive cache memory is normally slightly larger than thepreceding cache memory, and has a higher data latency than the precedingcache memory. Thus, for example, L1 cache 120 may only be capable ofstoring eight or sixteen kilobytes of data, but the data stored in thiscache may be accessible in a single processor cycle. L2 cache 130 mightthen be configured to store half or four megabytes of data that can beaccessed in two processor cycles. It should be noted that additionallevels of cache memory can be implemented, with each successive memoryhaving greater data capacity and greater data latency.

The conventional memory hierarchy of FIG. 1 is typically used because itprovides both a wide memory address space and relatively fast access tothe data stored in the memory. When processor 110 needs to access data,it forwards a data access request to L1 cache 120. If the data iscurrently stored in L1 cache 120, the data is returned by the L1 cacheto the processor. If the desired data is not currently stored in L1cache 120, the L1 cache forwards the data access request to L2 cache130. If L2 cache 130 currently stores the data, the data is returned bythe L2 cache to L1 cache 120, which then forwards the data to processor110. L1 cache 120 also stores the data returned by L2 cache 130 so thatit will be available in the L1 cache if processor 110 makes anotheraccess request for this data. If L2 cache 130 does not currently storethe desired data, the L2 cache will forward the data access request tomain memory 150. Main memory 150 will retrieve the requested data andreturn the data to L2 cache 130, which will store the data itself andforward the data to L1 cache 120. As noted above, L1 cache 120 will alsostore the data, and will forward the data to processor 110.

Some computer systems use other memory architectures to store data thatis used by the systems' processor(s). For example, in the multiprocessorcomputer system illustrated in FIG. 2, each processor core is designedto access only data which is stored in a local memory associated withthat core. As shown in the functional block diagram of FIG. 2, thesystem includes multiple processing elements (SPE's) 210-230, each ofwhich is connected to a main memory 250 by an internal bus 260. In thissystem, each of the processing elements is designed to perform arelatively specialized function using instructions and data that can bestored in a relatively small amount of memory space.

Each processing element includes a processor core (SPC), a local memory(LM) and a memory flow control unit (MFC). For example, processingelement 210 includes processor core 211, local memory 212 and memoryflow control unit 213. The memory flow control unit of each processingelement functions as a direct memory access (DMA) engine which transfersdata between the corresponding local memory (e.g., 212) and main memory250. Because each processing element performs a relatively specializedfunction, the instructions and data necessary to perform a function cantypically reside within the local memory of the processing element,which may have a data capacity on the order of 256 kilobytes. The memoryflow control unit for the processing element therefore retrieves thenecessary instructions and data from main memory 250 and then loadsthese instructions and data into the local memory for execution andprocessing by the processor core. If new instructions and/or data areneeded by the processor core, the memory flow control unit will transferunneeded data from the local memory back to main memory 250 ifnecessary, and will load new instructions and/or data from the mainmemory into the local memory. These data transfers may be performed inparallel with the execution of other instructions and/or processing andother data by the processor core.

A multiprocessor computer system of the type illustrated in FIG. 2 canprovide a great deal of processing power because multiple processorcores are used. It may be more difficult, however, to take advantage ofthis processing power than to make use of a conventionalsingle-processor system because of the limited amount of local memoryspace that is available for storing instructions and data that are usedby the individual processing elements in the multiprocessor system.While a novice programmer can, with relative ease, produce a programthat is efficiently executed by a single-processor computer systemsupported by a conventional memory hierarchy, it typically requires muchgreater proficiency to be able to produce a program that is executablewithin the constraints of the limited local memory of the multiprocessorsystem. The increased skill level that is required to program amultiprocessor system such as the one illustrated in FIG. 2 may limitthe utility of such a system. It would therefore be desirable to providea computer system that has the increased computational power of themultiprocessor system and the ease-of-programming of the conventionalsingle-processor system.

In order to provide these benefits, one multiprocessor systemincorporates a cache memory into each processing element to provide alink between the processor core of the processing element and the mainmemory. This is illustrated in FIG. 3. FIG. 3 is a block diagramillustrating the structure of a multiprocessor system that includes botha primary processor core that is linked to the main memory using aconventional memory hierarchy, and multiple processing elements thatinclude local memories and cache memories that are linked to the mainmemory.

In this embodiment, primary processor core 310 is linked to main memory320 through three levels of cache memories, 331-333 (L1-L3,respectively). Primary processor core 310 accesses data through thisconventional memory hierarchy, first accessing L1 cache 331, then L2cache 332, then L3 cache 333, and finally main memory 320. Processingelement 340 is one of a plurality of processing elements in themultiprocessor system. The remaining processing elements are not shownin this figure for purposes of clarity, but these processing elementshave structures which are identical to the structure of processingelement 340.

Processing element 340 includes a processor core 341, a local memory 342and a memory flow controller 343 that are essentially as described abovein connection with FIG. 2. Processing element 340, however, can accessnot only data that is stored in local memory 342, but also the data thatis stored in a local cache 344 (the SL1 cache) and main memory 320.Thus, while processor core 341 can rapidly access the data stored inlocal memory 342, it is not limited to accessing this data, which mustbe loaded by memory flow controller 343 from main memory 320. Processorcore 341 can also access the entire memory space that is available toprimary processor 310 by forwarding a request for this data to SL1 cache344.

SL1 cache 344 is coupled to main memory 320 to form a memory hierarchysimilar to the one used by primary processor 310, except that the memoryhierarchy coupled to processor core 341 has a single level of cachememory, rather than the three levels formed by cache memories 331-333.It should be noted that, in an alternative embodiment, SL1 cache 344 canbe coupled to the caches of the primary processor (e.g., to L3 cache(333) as indicated by the dashed line), rather than being directlycoupled to main memory 320. In this case, processor core 341 wouldaccess main memory 320 through SL1 cache 344 and L3 cache 333. SL1 cache344 is, in this embodiment, a small cache, storing only 8-32 kilobytesof data. SL1 cache 344 is, however, configured to use the full (e.g.,64-bit) addressing employed by primary processor 310, so that processorcore 341 can access all available data in main memory 320.

The LS1 cache illustrated in FIG. 3 can be implemented in a variety ofmultiprocessor systems. For example, the Cell processor jointlydeveloped by Toshiba, Sony and IBM has eight processing elements intowhich the LS1 cache can be incorporated. Referring to FIG. 4, aCell-type multiprocessor system 400 includes a primary processingelement 410, and eight specialized processing elements 411-418. Each ofprocessing elements 410-418 is coupled to an internal bus 450 whichallows the processing elements to communicate with each other and withother components of the system. Input/output (I/O) interfaces 460 and461 are coupled to internal bus 450 to allow the processing elements tocommunicate with other components of the system that are external to thedie on which the processing elements are constructed. Primary processingelement 410 includes a first-level cache 420, which is coupled to asecond-level cache 430, which is in turn coupled to a main memory 440.Each of specialized processing elements 411-418 includes its own localcache (421-428, respectively) which functions in the same manner as theSL1 cache 344 of FIG. 3. Local caches 421-428 couple the respectivespecialized processing elements to main memory 440.

The systems illustrated in FIGS. 3 and 4 can enable even a noviceprogrammer to write applications for execution by a multiprocessorsystem. Rather than being confined to the amount of space in the localmemory, the programmer can access any available data in the main memorythrough the SL1 cache. The availability of data through the SL1 cachemay also relieve the programmer of the need to program data transfers toand from the local memory using the memory flow controller.

By enabling access to a wider memory space, the addition of the local(SL1) cache facilitates the programming of the specialized processingelements in the Cell-type processor, and thereby makes thismultiprocessor system available for use by programmers having a muchwider range of skills. In particular, this allows novice programmersgreater freedom in programming the specialized programming elements. Thegreater ease of programming the specialized processing elements opensthis type of processor to a greater range of applications and makesavailable the increased number of processing threads that can be handledby this multiprocessor system. In one embodiment, the primary processingelement can execute two threads, while each of the specializedprocessing elements can process a single thread. The processor cantherefore execute ten threads simultaneously. Other systems may use twoor even three primary processing elements that can each execute twoconcurrent threads, but this still allows a maximum of only six threads,in comparison to the ten threads of the Cell-type processor.

One embodiment of the multiprocessor system illustrated in FIGS. 3 and 4is configured as a SIMD, bi-endian system. In other words, the systemcan operate on multiple pieces of data (e.g., by adding, shifting orperforming other operations on an arrays of data words in multiple-wordregisters), and can also function cooperatively with devices that areboth big-endian and little-endian.

“Endian-ness” refers to the ordering of bytes of data or instructions ina computer system. The bytes of a data word may be stored beginning withthe most significant byte, or they may be stored beginning with theleast significant byte. Conventionally, a big-endian system stores themost significant byte at the lowest address for the byte or word, whilea little-endian system stores the least significant byte at the lowestaddress. It is important in designing a computer system to understandthis difference and to be able to convert data, if necessary, betweenbig-endian and little-endian representations.

While the endian-ness of data is well understood and typically handledwithout unusual difficulty by system designers and programmers, thereare some issues related to endian-ness that are less common, but mustnevertheless be addressed in order to prevent misinterpretation of data.One such issue involves the use of preferred slots in multiple-byte ormultiple-word lines of data.

One embodiment of the system illustrated in FIG. 4 utilizes 128-bitlines of data. Each line of data comprises 16 bytes, or four 32-bitwords. Each line may therefore be referred to as a “quadword”. This is aconvenient data size because this particular embodiment is configured toprocess data and instructions as a SIMD processor. The SIMD processorreceives a 128-bit line of data and can interpret this data as fourseparate pieces of information. In order to properly interpret the lineof data, the processor must know whether the bytes in the line representdata, control information, or both. This is accomplished by using a“preferred slot”.

A preferred slot is a predetermined slot, or location, within the dataline. The processor interprets the data that falls within the preferredslot as scalar data—data that is applied to, or relates to, all data ina 128-bit register. The scalar data may be, for example, a load/storeaddress, a shift amount, a result derived from the other data, etc. Thedata that does not fall in the preferred slot may be ignored.

For instance, when the shift-word instructionRT(i)=RA(i)<<RB(0)is executed, a scalar shift value is read from the preferred slot inregister B, and each word in register A is shifted by the number of bitsspecified by the scalar shift value. The shifted words are then storedin a target register (RT).

The preferred slot changes with the endian-ness of the systemconfiguration. Typically, the location of the preferred slot in abig-endian system is in the most significant word, while the location ofthe preferred slot in a little-endian system is in the least significantword.

Referring to FIG. 5, a diagram illustrating the locations of preferredslots in a 16-byte data line in a 128-bit register in one embodiment isshown. FIG. 5 depicts six lines of data. Each line of data is 16 bytes(128 bits) long, as indicated by the blocks in each line. Each lineshows the preferred slot in which a piece of data (or address) isstored, depending upon its size. From top to bottom, the lines show thepreferred slots for a byte, a halfword (two bytes), an address (fourbytes), a word (four bytes), a doubleword (eight bytes), and a quadword(16 bytes).

As shown in FIG. 5, the most significant bit of each data line is on theleft, while the least significant bit is on the right. Consequently, ina big-endian scheme, the bytes are ordered from left to right. This isindicated by the numerals at the bottom of the figure, which go from 0on the left (the most significant byte) to 15 on the right (at leastsignificant byte). Byte 0 is stored at the address of the data line,with each successive byte stored at the next data address. Thus, if thedata line is stored at address xxx00, the most significant byte isstored at address xxx00, the next most significant byte is stored atxxx01, and so on. In a little-endian scheme, the bytes are ordered fromright to left, and as indicated by the numerals at the top of thefigure. When this scheme is used, the least significant byte is storedas a byte 0, and the most significant byte is stored as byte 15. As aresult, if the data line is stored at address xxx00, the leastsignificant byte is stored at address xxx00, the next least significantbyte is stored at xxx01, and so on.

The preferred slots in the data lines are indicated in FIG. 5 by shadingand dotted lines. The preferred slots for a little-endian scheme areshaded, while the preferred slots for a big-endian scheme are indicatedby the dotted lines. In the little-endian scheme, the preferred slotsare at the lower end (lower address) of the data line, where the leastsignificant bytes are normally found. In this scheme, the preferred slotfor a single byte is byte 0. The preferred slot for a half word (twobytes) is bytes 0:1. The preferred slot for an address (32 bits) or afull word (four bytes) is bytes 0:3. The preferred slot for a doubleword(eight bytes) is bytes 0:7. Finally, the preferred slot for a quadword(16 bytes) consists of the entire data line (bytes 0:15).

In a big-endian scheme, the preferred slots begin with the moresignificant bytes, though not necessarily the most significant bytes, ofthe data line. It should be noted that the more significant bytes are atthe lower addresses in the big-endian scheme, rather than at the higheraddresses as in the little-endian scheme. In the big-endian scheme, thepreferred slot for a single byte is byte 3. The preferred slot for ahalf word (two bytes) is bytes 2:3. The preferred slot for an address(32 bits) or a full word (four bytes) is bytes 0:3. The preferred slotfor a doubleword (eight bytes) is bytes 0:7. While the preferred slotsfor the address, word and doubleword all occupy the same byte numbers asin the little-endian scheme, these bytes are located at the moresignificant end of the data line rather than the less significant end asin the little-endian scheme. The preferred slot for a quadword (16bytes) consists of the entire data line (bytes 0:15), so there is nodifference from the little-endian scheme.

As noted above, the preferred slots in the little-endian scheme allstart at the base address of the data line (i.e., they all begin withbyte 0), but the preferred slots in the big-endian scheme do not. Thepreferred slots in the big-endian scheme may begin with bytes 0, 2 or 3,depending upon the size of the preferred slot. It can be seen that, forthe preferred slots which fit within a single 32-bit word, the slots arein the same position with respect to the least significant bytes withinthe word. That is, each of the preferred slots begins with the leastsignificant byte of the word. The words in which the preferred slots arelocated, however, are transposed from one scheme to the other—whilethese preferred slots are located in the least significant word in thelittle-endian scheme, they are in the most significant word in thebig-endian scheme. It can also be seen that the doubleword preferredslot is transposed from the least significant double word in thelittle-endian scheme to the most significant double word in thebig-endian scheme.

Because the positions of the preferred slots may change in a bi-endiansystem depending upon whether a big-endian or a little-endian scheme isused, it is necessary to provide some means to handle data so that thepreferred slot is properly recognized are interpreted, and so that datawhich is destined for the preferred slot is stored in the properlocation. Referring to FIG. 6, a functional block diagram illustratingthe structure of a processor core within a bi-endian multiprocessorsystem in accordance with the prior art is shown. This processor core isdesigned to use several multiplexers to read data from or store data toeither a position associated with a big-endian preferred slot, or aposition associated with a little-endian preferred slot.

The processor core of FIG. 6 includes a load/store unit 610, a registerfile 620 and an arithmetic logic unit 630. These components of theprocessor core perform the desired operations on the data that isprocessed by the processor core. Multiplexers 640-643 are used to selectthe appropriate preferred slot positions based upon the mode (big-endianor little-endian) in which the processor core is operating.

It can be seen that load/store unit 610, register file 620 andarithmetic logic unit 630 each include a set of dotted lines whichdivide the respective components into four portions. These dotted linesdo not represent a physical subdivision of the components, but areinstead used to show that each component can process one line of datacontaining four 32-bit words concurrently. Load/store unit 610 retrievesa line of data from memory (e.g., a local memory) and provides this datato register file 620. Register file 620 temporarily stores the data sothat it can be accessed by arithmetic logic unit 630. Data which isprocessed (or generated) by arithmetic logic unit 630 may then be outputor stored in register file 620, from which load/store unit 610 may storethe data in the local memory associated with the processor core.

As noted above, the preferred slot in the line of data may be in eitherthe first word or the last word of the line, depending upon whether theprocessor core is operating in a big-endian mode or a little-endianmode. The processor core must therefore be able to properly handle thedata that is in the preferred slots. This is accomplished by includingseveral multiplexers in the design. More specifically, thesemultiplexers must be included wherever a preferred slot is used. Themultiplexers select between the big-endian preferred slot position andthe little-endian preferred slot position. As illustrated in FIG. 6,multiplexer 640 is used to select line of the preferred slot positionsin register file 620 so that the preferred slot data can be provided toload/store unit 610. Multiplexer 641 is used to select the properpreferred slot position in arithmetic logic unit 630 so that theappropriate data can be used to shift amount values within thearithmetic logic unit. Multiplexers 642 and 643 are used to select theproper preferred slot positions with respect to the output of arithmeticlogic unit 630.

While the multiplexers used in conventional bi-endian systems serve thedesired purpose, there are some drawbacks to this solution. Forinstance, because it takes time for the selected data to propagatethrough the multiplexers and to be provided to the appropriate logiccircuits, the multiplexers increase the delay in the affected datapaths. Because the multiplexers are typically within critical paths thatdefine the cycle time for the system, the increased delays may reducethe maximum operating frequency of the system. Still further, theadditional circuitry of the multiplexers and corresponding interconnectsrequires additional area on the integrated circuit die, which increasesthe expense of manufacturing the system.

The various embodiments of the present invention may reduce or eliminatethese shortcomings by eliminating the need for the multiplexers that areused to select the appropriate preferred slot position. This isaccomplished by swapping, if necessary, the positions of the wordswithin each data line when the data line is transferred between the mainmemory and the local memory of an individual processing element. In oneembodiment, the processing element is configured to operate on datausing a little-endian representation. When the processing elementoperates in conjunction with a device that also uses a little-endiandata representation, data is transferred from the main memory to thelocal memory without any changes to the data. When the processingelement operates in conjunction with a device that uses a big-endiandata representation, the words of each data line transferred to thelocal memory are transposed, thereby moving the preferred slot data fromthe big-endian preferred slot position to the little-endian preferredslot position. When data is transferred from the local memory back tothe main memory, the words of each data line are again transposed, sothat the preferred slot data is moved from the little-endian preferredslot position to the big-endian preferred slot position.

Referring to FIG. 7, a functional block diagram illustrating thestructure of a processing element in accordance with one embodiment ofthe invention is shown. This figure depicts a processor core such as theones illustrated in FIGS. 3 and 4, as well as the local memory and amemory flow controller that transfers data between the local memory anda main memory.

The structure of the processor core in FIG. 7 is similar to thestructure of the prior art processor core of FIG. 6, in that it includesa load/store unit 710, a register file 720 and an arithmetic logic unit730. The processor core of FIG. 7, however, does not incorporatemultiplexers for the purpose of selecting different preferred slotpositions. Whenever a preferred slot is used, the preferred slot isalways in the same position (i.e., the preferred slot is in the first,or left-most, word of the data line). The data in the preferred slot cantherefore be provided directly to various logic circuits (e.g., providedto load/store unit 710 or used to shift the data within arithmetic logicunit 730) without having to propagate through any multiplexers. Thisavoids the additional delays, potential frequency reductions, andadditional circuitry and expense of the prior art system.

Referring to FIG. 8, a pair of diagrams illustrating the modes ofoperation of the memory flow controller are illustrated. The diagram onthe left side of the figure illustrates the simple transfer of databetween main memory 760 and local memory 740 in a first mode ofoperation. This mode is used when the endian-ness of the main memory isthe same as the endian-ness of the local memory and processor core. Inthe figure, both main memory and local memory are shown as using abig-endian representation of data, but the data transfer operationremains the same if both use a little-endian representation. When thememory flow controller is operating in this first mode, the data word(bits 96:127) which occupies the first word-position in the data line inthe main memory also occupies the first word position in the data linewhen it is stored in the local memory. Likewise, the data words in thesecond, third and fourth positions in the main memory also occupy thesecond, third and fourth positions, respectively, in the local memory.

The diagram on the right side of the figure illustrates thetransposition of data words when data lines are transferred between mainmemory 760 and local memory 740 in a second mode of operation. This modeis used when the endian-ness of the main memory is different from theendian-ness of the local memory and processor core. In this diagram, themain memory is shown as using a little-endian data representation, whilethe local memory is shown as using a big-endian data representation.Alternatively, the main memory could use a big-endian representationwhile the local memory uses a little-endian representation. When thememory flow controller is operating in this second mode, the data wordsof each data line are transposed as they are transferred between themain memory and the local memory. In other words, the first data word inthe line stored in the main memory is stored in the local memory as thelast data word in the line. Similarly, the second, third and fourth datawords in the main memory are stored as the third, second and firstwords, respectively, in the local memory.

It should be noted that, while the memory flow controller transposes(when necessary) the words in the data lines, the bits within the words(or doublewords, or quadwords) are not transposed. The potential needfor transposition of bits is known in the art and can be implemented asneeded by programmers or system designers using conventional techniques.It should also be noted that transposition of the bits in a 128-bitline, 64-bit doubleword or 32-bit word does not necessarily result inthe same preferred slot positions as the transposition of wordsdescribed herein.

As noted above, the processing elements may incorporate their own cachememories (the SL1 caches) to allow the processing elements access to awider memory space than is available in their respective local memories.The SL1 caches may be configured to function in a manner similar to thatof the memory flow controllers, so that data accesses to the main memory(or other levels of cache memories) involve either straight transfers ortranspositions of data words, depending upon the selected mode ofoperation.

In the description above, data is transferred either to or from the mainmemory. It should be noted that the processing elements may be able toaccess data in or through components other than the main memory. Forinstance, the virtual address space accessible to a processing elementmay include its local memory, the main memory, video memory,input/output devices (memory mapped I/O), and so on. In an alternativeembodiment of the invention, the processing element can selectivelyaccess these data sources using the first and second modes describedabove. A single selected mode can be applied to all of the data sources,or the mode can be selected independently for each of the data sources,depending upon the endian-ness of the respective data sources.

In one embodiment, the system is configured to select the appropriateoperating mode at startup. This selection may be made in response todetecting that data source to be accessed (e.g., the main memory) has aparticular endian-ness, or the selection may be made manually.

While the transposition of words within a data line as described abovedoes not affect operations on individual 32-bit words, such as 32-bitadds, it may affect other operations. For example, a doubleword add isaffected, because the two halves of each doubleword are reversed.Normally, a carry bit from the least significant word is added to themost significant word in the doubleword. If the words within thedoublewords are reversed, however, the carry bit is carried from themost significant word to the least significant word, resulting in anerror. It is therefore necessary to incorporate modifications to thelogic of the processor core to account for the potential reversal of thehalves of the doublewords.

Referring to FIG. 9, a functional block diagram of a logic circuitconfigured to properly carry a bit from the least significant word tothe most significant word in the doubleword is shown. Conventionally,the least significant words of two doublewords are added, and a carrybit from the least significant words is added, if necessary, to the mostsignificant words of the two doublewords. Because the positions of theleast significant and most significant words within each doubleword aredependent upon whether the data words have been transposed, it isnecessary to provide a mechanism to add the carry bit to the appropriatehalves of the doublewords. An exemplary mechanism is illustrated in FIG.9.

FIG. 9 shows four 32-bit adders (910-913). Each of adders 910 and 912receives 32-bit words from registers RA0 and RB0. These registers storethe first halves of the doublewords that are being added. The firsthalves of the doublewords may be either the least significant or mostsignificant words, depending upon whether the words within the datalines have been transposed. Each of adders 911 and 913 receives 32-bitwords from registers RA1 and RB1, which store the second halves of thedoublewords that are being added. If registers RA0 and RB0 store theleast significant words of the doublewords, registers RA1 and RB1 storethe most significant words of the doublewords, and vice-versa.

While adders 910 and 912 both receive and add the same two 32-bit words,adder 912 also adds an extra bit that represents a carry bit. Adder 910does not add this extra bit to the values in registers RA0 and RB0.Similarly, adders 911 and 913 both receive and add the same 32-bit words(from registers RA1 and RB1), but adder 913 also adds an extra bit,while adder 911 does not. Each of the pairs of adders (910/912 and911/913) provides its output to a corresponding multiplexer whichselects the appropriate one of the adders' outputs based upon the modein which the system is operating.

The sums computed by adders 910 and 912 are provided as inputs tomultiplexer 920. The sums computed by adders 911 and 913 are provided asinputs to multiplexer 921. Multiplexer 920 receives a control input fromAND gate 930, while multiplexer 921 receives its control input from ANDgate 931. AND gate 930 receives as inputs a carry bit from adder 911 anda control signal (BE) that is asserted (high) when the main memory andlocal memory have same endian-ness. For example, when the main memory orother data source uses a big-endian data representation and theprocessor core uses a big-endian data representation. AND gate 931receives as inputs a carry bit from adder 910 and a control signal (LE)that is asserted (high) when the main memory and local memory havedifferent endian-ness. For example, when the main memory or other datasource uses a little-endian data representation, and the processor coreuses a big-endian data representation.

In operation, only one of the BE and LE control signals will be asserted(high in this embodiment). The other of these control signals will bede-asserted (low in this embodiment). If the LE signal is asserted, thisindicates that the data source uses a little-endian data representation(while the processor core uses big-endian data representation). In thiscase, registers RA0 and RB0 will contain the least significant words ofthe doublewords that are being added, and registers RA1 and RB1 willcontain the most significant words. Consequently, no carry bit should beadded to the values in registers RA0 and RB0, but a carry bit should beadded to the values in registers RA1 and RB1 if one is generated by theaddition of the least significant word values in adder 910. Since the LEcontrol signal is high, the carry bit is simply passed through AND gate931. If the carry bit is 0, this input to multiplexer 921 will cause themultiplexer to select adder 911, which produces the sum of the mostsignificant words without an extra bit. If the carry bit is 1, thisinput to multiplexer 921 will cause the multiplexer to select adder 913,which produces the sum of the most significant words with an extra bit.Because control signal BE is low, any carry bit generated by adder 911is blocked by AND gate 930, so the output of the AND gate is 0 and theoutput of adder 910 is selected.

The BE control signal is asserted when the data source uses a big-endiandata representation and the processor core uses a big-endian datarepresentation. In this case, registers RA0 and RB0 will contain themost significant words of the doublewords that are being added, andregisters RA1 and RB1 will contain the least significant words.Consequently, a carry bit generated by the addition of the values inregisters RA1 and RB1 should be carried to the addition of the values inregisters RA0 and RB0, but no carry bit should be added to the values inregisters RA1 and RB1. Since control signal BE is high, a carry bitgenerated by adder 911 will be passed through AND gate 930 tomultiplexer 920. If the carry bit is 0, multiplexer 920 will select theoutput of adder 910, which does not add an extra bit to the values inregisters RA0 and RB0. If, on the other hand, the carry bit is 1,multiplexer 920 will select the output of adder 912, which does add anextra bit to the words in registers RA0 and RB0. Because control signalLE is low, any carry bit generated by adder 910 is blocked by AND gate931, so the output of the AND gate is 0 and the output of adder 911 isselected.

As pointed out above, the doubleword adder of FIG. 9 assumes that theprocessor core in which the adder is implemented uses a big-endian datarepresentation. In alternative embodiments, the processor core may beconfigured to use a little-endian data representation, in which case thedoubleword adder may be modified to account for this difference. Itshould also be noted that the logic circuitry of FIG. 9 may beimplemented in alternative embodiments using different components andthose shown in the figure. For instance, the doubleword adder could beimplemented using only two 32-bit adders, with the carry bit from eachadder being AND'ed with the appropriate one of the LE and BE controlsignals, and the result being added to the values summed in the otheradder. It is contemplated, however, that such an implementation wouldnot allow the processor core to operate at as high a speed as theimplementation illustrated in FIG. 9.

While the disclosure of the present application discusses the inventionin the context of multi-processor computing systems, it should be notedthat the invention is more widely applicable and can be used in avariety of other contexts. Consequently, the disclosure should not beconsidered as limiting the invention to the field of multimedia gamesystems.

The benefits and advantages which may be provided by the presentinvention have been described above with regard to specific embodiments.These benefits and advantages, and any elements or limitations that maycause them to occur or to become more pronounced are not to be construedas critical, required, or essential features of any or all of theclaims. As used herein, the terms “comprises,” “comprising,” or anyother variations thereof, are intended to be interpreted asnon-exclusively including the elements or limitations which follow thoseterms. Accordingly, a system, method, or other embodiment that comprisesa set of elements is not limited to only those elements, and may includeother elements not expressly listed or inherent to the claimedembodiment.

1. A device comprising: a memory flow controller having a direct memory access (DMA) engine; wherein the DMA engine is configured to transfer data between a local memory dedicated to a processing element and a data source external to the processing element; wherein the DMA engine is configured to transfer multiple-word lines of data; and wherein the DMA engine is configured to operate alternately in either a first mode or a second mode, wherein in the first mode, the DMA engine transfers each line of data without transposing the words in each line of data, and wherein in the second mode, the DMA engine transposes the words in each line of data when the DMA engine transfers the line of data; wherein the device further comprises a processing element configured to receive lines of data transferred by the DMA engine to the local memory, wherein the processing element includes logic circuitry configured to add doublewords in the lines of data, wherein the logic circuitry is configured to carry bits from the first words of the doublewords to the second words of the doublewords in response to determining that the data source implements a little-endian data representation, and wherein the logic circuitry is configured to carry bits from the second words of the doublewords to the first words of the doublewords in response to determining that the data source implements a big-endian data representation.
 2. The device of claim 1, wherein the DMA engine is configured to operate in the first mode when the local memory and the data source have the same endian-ness and wherein the DMA engine is configured to operate in the second mode when the local memory and the data source have different endian-ness.
 3. The device of claim 2, wherein the local memory implements a little-endian data representation and the data source implements a big-endian data representation.
 4. The device of claim 2, wherein the local memory implements a big-endian data representation and the data source implements a little-endian data representation.
 5. The device of claim 1, wherein each line of data comprises 128 bits and each word comprises 32 bits.
 6. The device of claim 1, wherein the DMA engine is configured to transfer each line of data without transposing a bit-order of each word within each line of data.
 7. The device of claim 1, further comprising a processing element configured to receive lines of data transferred by the DMA engine to the local memory, wherein the processing element is configured to utilize data within preferred slots in one or more of the lines of data as scalar data which is used to process data of other operands in the respective lines of data, and wherein the locations of the preferred slots are determined without regard to whether the local memory and the data source implement data representations that have the same endian-ness or different endian-ness.
 8. A system comprising: a plurality of processing elements, wherein each processing element includes a processor core, a local memory dedicated for use by the processor core, and a memory flow controller configured to transfer data; and one or more data sources external to the processing elements; wherein for each processing element, the memory flow controller is configured to transfer multiple-word lines of data between the processing element and the one or more data sources, wherein the memory flow controller is configured to operate alternately in either a first mode or a second mode, wherein in the first mode, the memory flow controller transfers each line of data without transposing the words in each line of data, and wherein in the second mode, the memory flow controller transposes the words in each line of data when the memory flow controller transfers the line of data; and wherein each processing element includes logic circuitry configured to add doublewords in the lines of data, wherein the logic circuitry is configured to carry bits from the first words of the doublewords to the second words of the doublewords in response to determining that the data source implements a little-endian data representation, and wherein the logic circuitry is configured to carry bits from the second words of the doublewords to the first words of the doublewords in response to determining that the data source implements a big-endian data representation.
 9. The system of claim 8, wherein each of the processing elements further comprises a local cache memory coupled between the processor core and a main memory external to the processing elements.
 10. The system of claim 8, wherein the memory flow controller is configured to operate in the first mode when the local memory and the data source have the same endian-ness and wherein the DMA engine is configured to operate in the second mode when the local memory and the data source have different endian-ness.
 11. The system of claim 10, wherein the local memory implements a little-endian data representation and the one or more data sources implement a big-endian data representation.
 12. The system of claim 10, wherein the local memory implements a big-endian data representation and the one or more data sources implement a little-endian data representation.
 13. The system of claim 8, wherein each line of data comprises 128 bits and each word comprises 32 bits.
 14. The system of claim 8, wherein the memory flow controller is configured to transfer each line of data without transposing a bit-order of each word within each line of data.
 15. A method for maintaining preferred slot positions in data lines that are transferred between a processing element and a data source external to the processing element, the method comprising: determining whether the processing element and the data source use big-endian or little-endian data representations; selecting either a first mode of operation or a second mode of operation, wherein the first mode of operation is selected in response to determining that the processing element and the data source use the same data representation, and wherein the second mode of operation is selected in response to determining that the processing element and the data source use different data representations; transferring lines of data between the processing element and the data source, wherein each data line includes multiple data words, wherein each line of data is transferred without transposing the words therein in response to selection of the first mode of operation, and wherein the words in each line of data are transposed when the line of data is transferred in response to selection of the second mode of operation, and reading scalar data from and writing scalar data to preferred slots in one or more of the lines of data, wherein the locations of the preferred slots are determined without regard to whether the processing element and the data source implement data representations that have the same endian-ness or different endian-ness.
 16. The method of claim 15, wherein the processing element implements a little-endian data representation and the data source implements a big-endian data representation.
 17. The method of claim 15, wherein the processing element implements a big-endian data representation and the data source implements a little-endian data representation.
 18. The method of claim 15, wherein each line of data comprises 128 bits and each word comprises 32 bits.
 19. The method of claim 15, wherein transferring lines of data between the processing element and the data source is performed without transposing a bit-order of each word within each line of data. 